Emorie D Beck
November 1, 2019
Data come in many forms, and I don’t just mean .csv, .xls, .sav, etc. Data can be wide, long, documented, fragmented, messy, and about anything else that you can imagine.
Failure to understand your data could end in improper techniques and flagrantly wrong inferences at worst.
A good workflow starts by keeping your files organized outside of R. A typical research project typically involves:
You can set this up outside of R, but I’m going to quickly show you how to set up inside R.
When I create an rmarkdown document for my own research projects, I always start by setting up my my workspace. This involves 3 steps:
Below, we will step through each of these separately, setting ourselves up to (hopefully) flawlessly communicate with R and our data.
Packages seems like the most basic step, but it is actually very important. ALWAYS LOAD YOUR PACKAGES IN A VERY INTENTIONAL ORDER AT THE BEGINNING OF YOUR SCRIPT. Package conflicts suck, so it needs to be shouted.
The second step is a codebook. Arguably, this is the first step because you should create the codebook long before you open R and load your data.
In this case, we are going to using some data from the German Socioeconomic Panel Study (GSOEP), which is an ongoing Panel Study in Germany. Note that these data are for teaching purposes only, shared under the license for the Comprehensive SOEP teaching dataset, which I, as a contracted SOEP user, can use for teaching purposes. These data represent select cases from the full data set and should not be used for the purpose of publication. The full data are available for free at https://www.diw.de/en/diw_02.c.222829.en/access_and_ordering.html.
For this tutorial, I created the codebook for you (Download (won’t work in Safari or IE)), and included what I believe are the core columns you may need. Some of these columns may not be particularly helpful for every dataset.
Here are my core columns that are based on the original data:
R. The reason I do it in Excel is that it makes it easier for someone who may be reviewing my codebook.Here are additional columns that will make our lives easier or are applicable to some but not all data sets:
Below, I will demonstrate each of these.
Below, I’ll load in the codebook we will use for this study, which will include all of the above columns.
# set the path
wd <- "https://github.com/emoriebeck/R-tutorials/blob/master/wustl_r_workshops/workflow"
download.file(
url = sprintf("%s/data/codebook.csv?raw=true", wd),
destfile = sprintf("%s/data/codebook.csv", data_path)
)
# load the codebook
(codebook <- sprintf("%s/data/codebook.csv", data_path) %>%
read_csv(.) %>%
mutate(old_name = str_to_lower(old_name)))First, we need to load in the data. We’re going to use three waves of data from the German Socioeconomic Panel Study, which is a longitudinal study of German households that has been conducted since 1984. We’re going to use more recent data from three waves of personality data collected between 2005 and 2013.
Note: we will be using the teaching set of the GSOEP data set. I will not be pulling from the raw files as a result of this. I will also not be mirroring the format that you would usually load the GSOEP from because that is slightly more complicated and somethng we will return to in a later tutorial on purrr (link) after we have more skills. I’ve left that code in the .Rmd for now, but it won’t make a lot of sense right now.
This code below shows how I would read in and rename a wide-format data set using the codebook I created.
# download the file
download.file(
url = sprintf("%s/data/workflow_data.csv?raw=true", wd),
destfile = sprintf("%s/data/workflow_data.csv", data_path)
)
old.names <- codebook$old_name # get old column names
new.names <- codebook$new_name # get new column names
(soep <- sprintf("%s/data/workflow_data.csv", data_path) %>% # path to data
read_csv(.) %>% # read in data
select(old.names) %>% # select the columns from our codebook
setNames(new.names)) # rename columns with our new names